The Best of all worlds

Introduction

Data scientists have and always will have to wear many hats. Evangelist, Stats nerd, ML practitioner, data storyteller, hacker, and problem solver are the archetypes that come to mind. In my own career I’ve had to wear all of these hats and a few more esoteric ones. The one thing that is unique to data science in my opinion though is how many of these hats we have to wear at the same time. I have constantly needed to transition from one to the other in my own career.

From a more practical perspective I’ve always noticed how different the programming / analysis / data visualisation environments changed depending on the hat I would be wearing for that hour / day / month. When I first get my hands on some data, it’s mostly SQL, then I’ll start plotting, then comes the pipeline construction, modelling, evaluation, presentation, then interativley back and forth between those things.

It quickly became apparent that some languages are better than others at certain tasks. The declarative nature of SQL makes it easy map domain concepts to data. In R, the tidyverse & mlverse packages great for data vizualisation, data wrangling, machine learning (and arguably deep learning). Python’s flexibility makes it an excellent glue. When performance matters (and you cant find it in the other languages) C++ is willing and able, javascript has taken the mantle as the lingua franca for data visualisation, and scala (spark) is synonymous with large scale, data intensive compute.

I’ve found that time and time again, in order to do my best work with the hat I’m wearing, I have often had to use multiple languages for a single use case. In the early days this was incredibly difficult but I have noticed that in recent years life has become noticeably easier for polyglot data scientists. Easy enough in fact to envision a future where polyglot data science is the norm.

A Polyglot World

Until very recently, data scientists seemed to be split into camps. Team Python & Team R, with alot of debate and discourse around which language was ‘better’, with the conclusion almost always being that it depends on the task. Sometimes python is better, sometimes R is better, so why not use them both? Leveraging the strengths of each one for the task at hand. This idea of using the best language / tool for the task or subtask is where Polyglot comes from. Poly = many, Glot = tongue.

Not that hard

An argument against this polyglot approach to data science might be, the effort it takes to ‘learn’ other languages - but this isn’t as tedious as it may seem, and the benefints are more than worth it.

For instance, one of the strengths of python is that it’s easier for non-programmers to learn, but those with programming experience in a ‘curly brace’ language i.e. C,C++,Java,C#,Javascript might say that R is easier because the syntax is superficially similar. Whatever the ‘first’ language is, learning a second need not be an ordeal. The fundamental concepts are the same (expressions, sequence, selection, repetition), and in the context of data science - you need only learn a scope that meets your problem context. For example my C++ is largely what’s called ‘Numerical C++’, meaning I only use parts of the language that are useful for numnerical computing use cases, I would not for example be able to write a high performance enterprise service bus in C++, what that does mean is that I can implement something I read from a paper and have it run an a high performance environment. Javascript is another example, I am by no means a frontend web developer, but by learning echarts, vega and observable - I can create interactive documents to communicate ideas to other data scientists, engineers & business stakeholders in an informative, comprehensive, and impactful way.

And speaking of Javascript….

Polyglot programming isn’t new. In late 2006 Neil Ford - Director and Software Architect at thoughtworks coined the term “Polyglot Programming” to describe an approach to programming where systems are built using multiple languages, leveraging the strengths of each language to build the components that would most benefit from those strengths. Sound familiar? Ford predicted that we were then “Entering a new era of software development” in which this paradigm would be more common. The most clear example of this is web development. Web developers have been polyglots for a long time. A lesser example in the data world would be SQL. You would be hard pressed to find a data/ml scientist/engineer who manages to avoid SQL entirely in favor of something else.

Putting it all together

Of course, there is a risk on making things more difficult by using using multipele languages / tools. Ford was aware of this when he coined the term saying:

“While polyglot programming will make some chores more difficult (like debugging), it makes others lot easier. It’s all about choosing the right tool for the job and leveraging it correctly.”

And by all accounts ford was correct. Computing environments from the enterprise to end user computing are very much polyglot ones. From standard language independent messaging protocols, and data interchange formats to containerisation, polyglot systems have become ubiquitous in how we build and use software.

So what about data science? How can we apply this approach more purposefully? How can we realise the benefits?

Well in a notebook environment, in order to support polyglot programming, only one thing is needed:

“The ability create and execute cells of different languages in the same notebook and [ideally] a way to pass data between them”

So you might have a SQL cell, then a python cell, then an R cell, maybe a JS one. With some background wiring to pass data between them. Some notebook environments like RStudio (specifically R notebooks), Apache Zeppelin, the Beaker Extensions for Jupyter, Quarto (also from RStudio) and DataLore support this kind of magic to a greater or lesser extent, but others will follow.

In addition, there are foundational projects that make that kind of functionality possible. Apache Arrow for instance, is a low level library that provides a language independent, high performance, in-memory data format to pass data between applications & programming environments (or notebook cells) at runtime.

The R and Python arrow bindings for instance, allow dataframes and other data to be passed between the languages for free (zero copy). There’s also the duckdb project, a serverless SQL engine (like sqlite) that leverages arrow. This means that with duckdb, a data scientist can run sql queries on in-memory dataframes in R or python (in a notebook for example), as well as files, and database native database tables.

Final Remarks

In the end, the idea of using the right tool for the right job and weaving them together is as applicable to data science as it is to enterprise software, from experience, it’s not as difficult as it might seem, and the benefits really are the best of all worlds.